URL normalization

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Contents

Normalization process

There are several types of normalization that may be performed. Some of them are semantics preserving and some are not.

Normalizations that Preserve Semantics

The following normalizations are described in RFC 3986 [1] to result in equivalent URLs:

HTTP://www.Example.com/http://www.example.com/
http://www.example.com/a%c2%b1bhttp://www.example.com/a%C2%B1b
http://www.example.com/%7Eusername/http://www.example.com/~username/
http://www.example.comhttp://www.example.com/
http://www.example.com:80/bar.htmlhttp://www.example.com/bar.html
http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html

These normalizations can be applied on URLs without changing the semantics.

Normalizations that Change Semantics

Applying the following normalizations result in a semantically different URL although it may refer to the same resource:

http://www.example.com/default.asphttp://www.example.com/
http://www.example.com/a/index.htmlhttp://www.example.com/a/
http://www.example.com/bar.html#section1http://www.example.com/bar.html
http://208.77.188.166/http://www.example.com/
https://www.example.com/http://www.example.com/
http://www.example.com/foo//bar.htmlhttp://www.example.com/foo/bar.html
http://www.example.com/http://example.com/
http://www.example.com/display?lang=en&article=fredhttp://www.example.com/display?article=fred&lang=en
However, Web servers differ in whether they allow the same variable to appear multiple times, and how this should be represented.[3]
http://www.example.com/display?id=123&fakefoo=fakebarhttp://www.example.com/display?id=123
http://www.example.com/display?id=&sort=ascendinghttp://www.example.com/display
http://www.example.com/display?http://www.example.com/display
http://www.example.com/display?category=foo/bar+bazhttp://www.example.com/display?category=foo%2Fbar%2Bbaz

Normalization based on URL lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL

http://foo.org/story?id=xyz

appears in a crawl log several times along with

http://foo.org/story_xyz

we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

References

  1. ^ RFC 3986, Section 6: Normalization and Comparison
  2. ^ RFC 3986, Section 2.3.: Unreserved Characters
  3. ^ http://benalman.com/news/2009/12/jquery-14-param-demystified/

See also